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Abstract 

High-dimensional sparse modeling with censored survival data is of great practi- 
cal importance, as exemplified by modern applications in high-throughput genomic 
data analysis and credit risk analysis. In this article, we propose a class of regu- 
larization methods for simultaneous variable selection and estimation in the additive 
hazards model, by combining the nonconcave penalized likelihood approach and the 
pseudoscore method. In a high-dimensional setting where the dimensionality can grow 
fast, polynomially or nonpolynomially, with the sample size, we establish the weak or- 
acle property and oracle property under mild, interpretable conditions, thus providing 
strong performance guarantees for the proposed methodology. Moreover, we show that 
the regularity conditions required by the Li method are substantially relaxed by a 
certain class of sparsity-inducing concave penalties. As a result, concave penalties such 
as the smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), 
and smooth integration of counting and absolute deviation (SICA) can significantly im- 
prove on the Li method and yield sparser models with better prediction performance. 
We present a coordinate descent algorithm for efficient implementation and rigorously 
investigate its convergence properties. The practical utility and effectiveness of the 
proposed methods are demonstrated by simulation studies and a real data example. 
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1 Introduction 



Advances in experimental technologies in molecular biology during the past decade have 
brought in a wealth of biomedical data. For instance, DNA microarrays now can be used 
to measure the expression of tens of thousands of genes in a sample of cells or to identify 
hundreds of thousands of single nucleotide polymorphisms for an individual at the same 
time. Data of this kind pose tremendous challenges to effective statistical inference, since 
the number of features, p, is large compared to the number of observations, n, in which case 
many classical inference methods can easily fail or become inapplicable. Variable selection, 
a powerful tool for sparse modeling, is a fundamental task in high- dimensional regression 
problems, which aims to select only a small set of important variables from a huge number 
of features, in the hope of alleviating the overfitting problem in high dimensions and im- 
proving the predic tive power and interpretability of the resulting model. See, for example. 



Fan and Lvl (120101 ) for a review of recent developments in high- dimensional variable selection. 



When clinical data on patient survivals are also available, it would be informative to link 
high-dimensional biomedical data to survival outcomes. A number of efforts have recently 
been made in this direction. Regularization methods, which can yield sparse models and 
hence perform simultaneous variable selection and estimation, are particularly useful and 
have gained increasing popularity. Several regularization methods o riginally develope d for 
linear regre s sion h ave been adapted to survival models. For example, iTibshiranil (119971 ) and 
Fan and Lil hoO'j ] ext ended the Lasso and n onco ncaye pena lized likelihood, respectively, to 



the Cox model, while 



Zhang and Lul (120071) and IZoul (120081 ) developed weighted Li meth- 



ods for the Cox model. ICai et al 



(120051) were the first to study regularization methods for 
survival models in a framework with p growing with n; they demonstrated that the noncon- 
cave penalized pseudo-partial likelihood estimato r for multivariate failure time data enjoys 
the oracle property when p grows slowly with n. lAntoniadis. Fryzlewicz. and Letuel (120101 ) 
studied the Dantzig selector for the Cox model in a high- dimensional setting, but did not 
address the issue of model selection consistency. In addition to their classical applications 
for survival analysis in public health, survival models have be en widely used to model time- 



to-ev ent data for credit risk analysis in finance and economics ( Jarro 
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2009; 



Fan. Lv. and Qi 



201ll ). Identifying important risk factors and quantifying their contributions are crucial aims 



of these problems. 

Variable selection techniques for survival data have also been extended beyond the Cox 
model. As a useful alternative to the Cox model, the additive hazards model assumes that 
the hazard function of a failure time T conditional on a p- vector of possibly time-dependent 
covariates Z(-) takes the form 

A(t|Z) = Ao(t)+/3^Z(t), 



T. 
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where Ao(-) is an unspecified ba seline hazard function and /3 n is a p -vector of regressio n 
coefficients ( Cox and Oakes 1984 . p. 74: Isreslow and Day 1987 . p. 182; Lin and Ying 1994 ). 
The additive models provide a characterization of the regression effects different than the 
multiplicative models and have some remarkable features that are not shared by the latter. 
In particular, model ([T]) pertains to the risk difference, or excess risk, a measure that is 
especially relevant and informative in epidemiological and clinical studi es. Variable se l ection 
in model ([1]) has been studied by a nu mber of authors; for example. iLeng and Mai (120071 ) 
proposed a weighted Li approach and Martinussen and Scheike ( 20091 ) considered several 
regularization methods including the Lasso and Dantzig selector. 

Despite the aforementioned developments, a rigorous high- dimensional theory that can 
provide strong performance guarantees for regularization-based variable selection methods 
in the survival setting is still lacking. Specifically, it is unclear how high dimensionality 
these methods can handle and what conditions are required for obtaining model selection 
consistency and nice sampling properties. The need for the development of such a theory 
is urgent, in view of the recent advances in understanding the perfor mance of regulariza- 
tion methods in the linear reg^ressiq i i and classificatio n contexts (e.g. 

Wainwrightlbood l 



Fan and Fan 


2008; 


Lv and Fan 


2009; 



Zhao and Yu 



2006 



In this article, we propose a class of regularization methods for simultaneous variable 
selection and estimation in model ([1]), by combining th e nonconcave pena lized likelihood 
approach ( Fan and hjboOll ) and the pseudoscore method ( Lin and Ying 1994 ). To justify the 
superior performance of the proposed methodology, we consider a high-dimensional setting 
where the dimension of covariates can grow fast, possibly nonpolynomially, wit h the sample 
size. Under mild, interpreta ble conditions, we establish the weak oracle property (ILv and Fan 



20091 ) and oracle property (IFan and Lill200ll ) of the proposed regularized estimators. Our 



high-dimensional analysis is innovative in that it involves empirical process techniques that, 
to the best of our knowledge, have not been previously used in the survival analysis literature, 
and provides new insights into the model selection properties of regularization methods for 
survival data. In particular, we show that the regularity conditions required by the Li 
method are substantially relaxed by a certain class of sparsity-inducing concave penalties, 
which includes some commonly used concave penalties as special cases (see Section 12. 2p . 
Furthermore, we present a coordinate descent algorithm for efficient implementation and 
rigorously investigate its convergence properties. The practical utility and effectiveness of 
the proposed methodology are demonstrated by both si mulated and real data. 

In an independent work closely related to this article, iBradic. Fan, and Jiand (120111 ) stud- 
ied regularized estimation for variable selection in the Cox model and obtained important 
oracle-type theoretical results in which the dimension of covariates may grow nonpolynomi- 
ally with the sample size. Besides model assumptions, a critical difference from our results. 
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however, is that they imposed a random condition on a large empirical covariance matrix; see 
their condition 8. Thus, it is natural to ask the question whether the regularized estimators 
still enjoy the desired properties if a similar condition is imposed on the population version 
of the matrix. Since the empirical covariance matrix involves the outcome variables, as is 
generally the case for survival models, a nonrandom condition on the population covariance 
matrix seems more natural and will provide more confirmative performance guarantees. Such 
conditions will also have the benefit that they can be viewed as high-dimensional extensions 
of the classical asymptotic regularity conditions in the low- dimensional setting, which are 
imposed on the population covariance matrix. We will provide an affirmative answer to this 
important question. 

The remainder of this article is organized as follows. In Section [21 we propose a class of 
regularization methods and discuss choices of the penalty function. The theoretical prop- 
erties of these regularized estimators are studied in Section |3l In Section HI we describe 
a coordinate descent algorithm, study its convergence properties, and discuss selection of 
tuning parameters. Simulation studies and a real data example are presented in Sections [5] 
and El respectively. Some discussion is provided in Section [71 and all proofs and technical 
details are relegated to the Appendices. 

2 Regularization Methodology 
2.1 Regularized Estimation 

We begin with the problem setup. Let T be the failure time and C the censoring time. 
Denote the censored failure time by X = T A C and the failure indicator by A = I(T < C), 
where /(■) is the indicator function. Let Z(-) = {Zi{-), ■ ■ ■ , Zp{-)) be a vector of predictable 
covariate processes and assume that T and C are conditionally independent given Z(-). 
The observed data consist of (Xj, Aj, Zj(-)), i = 1, ■ ■ ■ ,n, which are independent copies of 
{X, A, Z(-)). We assume that the conditional hazard function is given by model ([T|). 

We adopt the usual counting process notation. Define the observed-failure counting 
process Ni{t) = I{Xi < t, Aj = 1), the at-risk indicator Yi{t) = I{Xi > t), and the counting 
process martingale 



M,(t) = N,{t) - [' Y,{s){Xo{s) + (3^Z,{s)}ds. 
Jo 



We will also use N{t), Y{t), and M(t) to denote the generic versions of these processes 



The pseudoscore function of 



Lin and Ying is defined 



as 



U(/3) = - E / {Ht) - Zit)}{dN,it) - F,(t)/3^Z,(t) dt}, 
where (3 G MP, Z{t) = ^•(t)Zj(t)/ ^"^^^ yj()f:), and r is the maximum follow-up time. 
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This estimating function is linear in /3; through some algebraic manipulation, we can write 
U(/3) = b - V/3 with 

b=-V / {Z,{t)-Z{t)}dN,{t) 



and 



"^T^ Jo 



(2) 



where v®^ = vv'^ for any vector v. Since V is positive semidefinite, integrating — U(/3) with 
respect to (3 leads to the least-squares-type loss function 



L(/3) = -/3^V/3 - b^/3. 



(3) 



Using this loss function for regularized est i matio n in model (U) has been suggested by a 
number of authors, including Leng and Mai ( 2007 ) and Martinussen and Scheike ( 2009 ): the 
latter authors noted also that L{I3) can be interpreted as an empirical prediction error, up 
to a constant, for the part of the model orthogonal to the at-risk indicator. 

We now define the regularized estimator /3 as a solution to the regularization problem 



3 = argmin|Q(/3) ^ L{(3) + Vpa(|/3,|)1, 



(4) 



where (3 = ■ ■ ■ ,/3p)'^ and px{0), > 0, is a penalty function that depends on the 
regularization parameter A > 0. When the minimization problem is nonconvex, we will 
consider a local minimizer, as is common in the literature. It is often convenient to rewrite 
the penalty function as px{-) = ^P\{-)] we write Pa(-) as p(-) when it is free of A. Without 
the penalty term, f3 reduces to the pseudoscore estimator of Lin and Ying ( 1994 ). When 
the dimensionality is high, however, some form of regularization is needed to guard against 
overfitting, and the performance of the regularized estimator depends critically on the choice 
of the penalty function. Thus, in what follows we will first define a general class of penalty 
functions and discuss several popular choices among the class, and then present some theory 
to gain further insight into these choices. 

2.2 Penalty Function 

To answer t he qu estion on what kind of penalty functions are ideal for model selection. 



Fan and Lil ( 120011 ) advocated penalty functions giving rise to estimators with three desired 
properties: sparsity, unbiasedness, and continuity. These properties motivate consideration 
of a class of penalty functions that satisfies the following condition. 

Condition 1. The function p\{6) is increasing and concave in G [0, oo), and has a continu- 
ous derivative p'x{0) on (0, oo). In addition, p'x{0) is increasing in A and Pa(0+) = p'(0+) > 
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is independent of A. 



Some intuition for Condition [T] is as follows. The singularity at the origin encourages 
sparsity; the concavity assumption aims to reduce the estimation bias; the requirement that 
p'x{(^) is increasing in A allows A to effectively control the overall strength of the penalty. 
It should be noted that we do not require strict concavity or monotonicity, so that a wide 
range of penalty functions, including those that do not lead to all of the aforementioned three 
properties, are included in this class, which will facilitate our comparisons among different 



penalty functions. In the contexts of 
tions has been studied by lLv and Fan 
are the following examples. 



genera lized ) linear mod e 



fl2009h and 



s, th is class of penalty func- 



Fan and Lvl (120111 ). Of particular interest 



The Lasso (ITibshiranil Il996l ) uses the Li-penalty, i.e., p{9) = 6, 6 > 0. 



The smoothly clipped absolute deviation (SCAD) penalty (IFan 
is given by the derivative 



1997 



Fan and Li 



20011 ) 



e>o, 



(a - 1)A 

where a > 2 is a shape parameter. This penalty function takes off at the origin as the 
Li-penalty and then gradually levels off until its derivative reaches zero. 



The minimax concave penalty (MCP) proposed by [Zhang (120101 ) has the derivative 

{aX - e)+ 



aX 



9>0, 



where a > 1 is a shape parameter. In a similar spirit to SCAD, this penalty function 
gradually decreases its derivative to zero, except that it drops the Li part of SCAD. 



The s mooth integration of counting and absolute deviation (SICA) penalty (ILv and Fan 
20091 ) takes the form 

a + 1)9 



P{d) 



a + e 



> 0, 



(5) 



where a > is a shape parameter. With a varying from to oo, this family provides 
a smooth homotopy between the Lq- and Li-penalties. Each penalty function starts 
with slope 1 + a~^ at the origin, passes through the point (1, 1), and decreases its slope 
toward zero over the interval [0, oo). 

The Li-penalty is a convex relaxation of the Lo-penalty and falls at the boundary of 
the class of penalty functions that satisfies Condition [H Although the Li-regularization 
method enjoys the advantage of computational simplicity, it can suffer from several draw- 
backs which have motivated a number of improvements. The SCAD penalty was originally 
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proposed to alleviate the bias caused by the Li approach, and has been shown to possess the 
oracle property, i.e., the resulting estimator performs asymptotically as well as the oracle 
estimator which knew the true sparse model in advance. The estimation bias of the Lasso 
can also interfere with var iable selection; as a result, more stringent conditions such as the 
irrepresentable condition (jZhao and Ydl2006l ) are typically required for consistent variable 
selection. The advantages of concave penalties regarding mode l select i on con sistency have 
recently been revealed and justified by a number of authors. IZhangI ( 120101 ) showed that 
the MCP penalty has certain minimax optimality which enables it to strike a balance be- 
tween the superior theoretical properties of concave penalties and the computational cost of 
nonconvex regular ization problems. By investigating a nonasymptotic weak oracle property. 



Lv and Fan 



(120091 ) showed that the regularity conditions needed for the Li approach can be 
substantially relaxed by using concave penalties. The SICA family proposed in that article 
has the remarkable feature that it can be used to define a sequence of regularization problems 
with varying theoretical performance and computational complexity. 

3 Theoretical Properties 

Besides the choice of the penalty function, the performance of the regularized estimators 
depends on a variety of factors, such as the dimensionality of the model, the dependency 
among the covariates, and the choice of the regularization parameter. In order to deter- 
mine how these factors interact with each other and together affect the performance of the 
proposed estimators, in this section we rigourously develop a high-dimensional theory and 
discuss some of its implications. 

We begin by introducing some notation to be used in our theoretical results. For any 
vector V, recall that v®^ = vv^ and for notational convenience, we write = 1 and 
v®^ = V. Define 

s'^''\t) = E{Y{t)Z{tf''}, fc = 0,l,2, 
e(t) = s«(t)/sW(t), 



(6) 



and 



B = E 



E 



{Z(t) -e(t)}®2 ^jy(^) 



It is worthwhile to note that D is the population counterpart of the matrix V defined in ([2]) 
while 5] is the population counterpart of the matrix 



W = - V Hz^it) - Z{t)}^'dN,{t). 
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These matrices characterize the covariance structure of the model and will play a key role 
in our high- dimensional analysis. 

Furthermore, define the active set A = {j : f3oj ^ 0}, where /3oj, 1 < j < is the 
jth component of the true regression coefficient vector /3q. Let s = \A\, i.e., the number of 
nonzero coefficients in /3q, and we allow the dimension triple (s, n,p) to vary freely. Similarly, 
define the estimated active set A = {j : Pj ^ 0}, where /3 = , Pp)'^ ■ Denote the 

complement of a set B by B'^. We will use sets to index vectors and matrices; for example, 
/3oyi is the vector formed by the components /3oj of /Sg with j G A, and D^^a is the matrix 
formed by the entries Dij of D with i E A'^ and j E A. Define the (half) minimum signal 

d = ^mm\f3oj\. 



For any = {9i, ■ ■ ■ , Og)^ E W with ^ for all j, following ILv and Fan! t00% . define the 
local concavity of the penalty function px{-) at point as 



Finally, define 



k(Pa; ^) = lim max sup fP^M^AM 

e^0+ l<j<q \e^\-e<ti<t2<\ej\+e [ ^2 - ^1 



Ko = sup{k{px] e): e eR',\\0 - /3oa||oo < d}, 



and 



P — Ainin(DAyl) — Xkq, 

where A^[^{-) denotes the minimum eigenvalue. It is important to note that all the quantities 
defined above can depend on the sample size n, and we have suppressed this dependency for 
notational simplicity. 

3.1 Weak Oracle Property 



Lv and Fan 



( I2OO9I ) introduced the concept of weak oracle property for comparing different 
regular izat ion methods. An estimator is said to have the weak oracle property if it is both 
consistent in model selection and unifor mly consistent in e stimation. This notion is weaker 



than the oracle property introduced by iFan and Lil (j200l[ ) and hence can be satisfied by a 



broader class of estimators. To derive a nonasymptotic result regarding the weak oracle 
property of the proposed estimators, we need to impose the following conditions. 



Condition 2. (i) Xo{t) dt < oo. (ii) P{Y{t) = 1} > 0. (ni) There exist constants 

p( sup \Zj{t)\ >x) < Dexp{-Kx'') 



D, K,r > such that 
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for all X > and j = 1, ■ ■ ■ ,p. (iv) The sample paths of Zj{-), j = 1, ■■■ ,p, are of uniformly 
bounded variation. 



Condition 3. There exist constants a G (0, 1], 7 G [0, 1/2], and c > such that 

llD^.^D^^ilU < {(1 - ^ (c^")- 

In Condition [21 parts (i) and (ii) are standard for survival models; part (iii) controls the 
tail behavior of the covariates and is trivially satisfied for bounded covariates; part (iv) is a 
very mild technical condition that will facilitate ent ropy calculat i ons. 

Condition [3] is an analog of condition (35) in Lv and FanI ( 2009) for p enaliz ed least 
squares, which is in turn a generalization of condition (15) in IWainwrightl (120091 ) for the 
Lasso. Often for linear regression, such conditions are first imposed on the deterministic 
Gram matrix, and then a variety of random design matrices such as Gaussian ensembles 
can be further considered. For survival models such as model ([1]), however, there is no 
exact analog of the deterministic Gram matrix; here the matrix V, which plays the same 
role as the Gram matrix in linear regression, involves the at-risk indicators and hence is 
nondeterministic. Thus, Condition [3] is imposed on the population version D of V. Note 
also that we are not restricted to the cases where the covariates are bounded or Gaussian. 

The right-hand side of Condition [3] consists of two parts: the first part is an upper bound 
that reflects the intrinsic capability of the penalty function for variable selection; the second 
part is at most 0{^/n), where the parameter 7 needs to be determined by other conditions 
to be presented later. For the Li-penalty, the first part is bounded by constant 1, which is 
stringent; for concave penalties, the upper bound is generally relaxed, since concavity implies 
that p'x{9) is decreasing in 9 and thus p' (0+) / p\{d) can diverge asymptotically. When signals 
are fairly strong so that d ^ X, the first part imposes no constraint for the SCAD and MCP 
penalties, since p'x{d) = in that case. Also, the upper bound for the SICA penalty can be 
substantially relaxed by choosing a small value of a. 

Since Condition |3] and definitions of if and p involve the matrices 'Da<=aDaa^ ^aa^ 
Daa, a key step to establishing the weak oracle property is to show that the empirical 
counterparts of these matrices are close to them in some sense. This intermediate result is 
provided by the following lemma, which gives explicit probability bounds for similar con- 
ditions to hold for the empirical matrices. In what follows, let Ql denote the event that 
max^^j^ supjg[Q^^] l^il'^)! < for L > 0. 

Lemma 1 (Concentration of empirical matrices). Under Conditions [2H3, if p > and 

(f V p^^ = 0{^Jn/s), then there exist constants D,K > such that 

P(||V:^^|U > 2cp I ^l) < .^Dexpj-ir-^ (^-1^ A 1^ |, (7) 
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p 



-1 



> 



a\ p'(0+) 



2 J p'.id) 



A (2cn^) I 



< {p — s)sD exp 



A 1 



(8) 



L4 \^(^252 



A 1 



and 



P(Amin(V^^) < A/€o I ^^l) < D -K 



L4 V s2 



A 1 



(9) 



Inequalities ([7]) and ([8]) show that there would not be much difference if we had defined 
the quantity if or imposed Condition [3] on the empirical matrices. The eigenvalue condition 
Amin(V^A) > Ako is needed for identification of a strict local minimizer of problem ([1]); 
inequality ([9]) says that this condition holds with high probability if Ajam(DAA) and Akq 
have a positive gap fi that does not shrink to zero too fast. 

We now state our main theoretical result regarding the weak oracle property of the 
proposed estimators. 

Theorem 1 (Weak oracle property). In addition to ConditionslJ\{^ assume that the follow- 
ing conditions hold: 



n{p'^{d)-^ A 
ip'^s'^ (log pY^ 
nA2 



— > oo, 



n{ip ^ A fiY 



— !■ OO, 



n 



(logp) 



ri 



— )■ CXD, 



(log s) 



— )■ OO, 



and 



(10) 
(11) 

(12) 



d > ciipXp{0+), 

where > 0, ri = (r + 4)/r, and Ci = 2 + l/(4c). Then, for some constants D,K > 0, with 
probability at least 



D exp 



-Kn' 



iM/ (v^''A/i)- 



A 1 



l/n-i 



\2 \ l/'f-i 
Dexp^-irnV-(_Al 



^ 1, 



there exists a regularized estimator (3 that satisfies the following properties: 

(a) (Sparsity) /3^c = 0. 

(b) (L^-loss) - /^oaIIoc < Ciy?Ap'(0+). 

To develop intuition for the two conditions in ([TUj) . we consider some simplified cases. 
First, concavity implies that p'x{d) < p'(0+); thus, a sufficient condition for the first condition 
in ([TOD to hold is 

^oo. (13) 



Lp'^s'^ilogpY 
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Consider the second condition in (fTOj) and recall that /i = Ajnm(D^A) — Akq- For the Li- 
penalty, kq = 0; for SCAD and MCP, Akq = (a — 1)^^ and a^^, respectively. Thus, for this 
condition to hold for these penalties, it suffices to (Daa) is bounded away 

from zero and that 

n 

oo, 



where the latter is implied by f|T3|) . Therefore, conditions in f fTOj) are primarily constraints on 
the growth rates of the model dimensions p and s and certain matrix norms ofD^^. On the 
other hand, if we assume, for simplicity, that (p is constant, then (fT3ll gives a lower bound for 
the number of observations that are needed for guaranteed sparse recovery, n ^ s"^ {log pY^. 
This is an interesting setting, since it shows that the proposed estimators can handle a 
nonpolynomially growing dimension of covariates as high as logp = o(n^/'"^), while the 
dimension of the true sparse model grows as s = o{^/n). In particular, for bounded covariates, 
we can take ri = 1 by letting r — )■ oo and thus allow logp = o{n). 

For simplicity, consider the case of bounded covariates, i.e., ri = 1. The two conditions 
in ( ITTl) give a lower bound for the regularization parameter A, 



/logo / logs 



Thus, in view of (fT2|) . we see that A should be chosen to satisfy 



logp ^ }ogs_ ^^^^ d 



n V ni-2T - ciipp'{0+)' 
For such choices of A to exist, the minimum signal d must satisfy 



log p / log s 



Recall that 7 G [0,1/2] has appeared in Condition [3l More insight can be gained by 
comparing the two parts on the right-hand side of f|T^ : the first part will dominate if 
n'~' ^ a/ (log /(logs), and in this case. Theorem [1] guarantees recovery of signals that sat- 
isfy d ^ (logp)/n, independent of 7; otherwise, the second part will dominate, and the 
weakest recoverable signal will depend on the correlation between the two sets of true vari- 
ables and noise variables as refiected by the value of 7. Of course, for the Li-penalty, since 
the first part in Condition |3] always dominates the second part, we can simply take 7 = 0, 
and thus the value of 7 plays no role in determining the lower bound for d. 

Despite the similarities b e tween the results presented h ere and those for (generalized) 
linear models in iLv and FanI ( l2009l ) and iFan and Lvl ( 1201 ll ). it is worthwhile to note some 



important differences. The restriction on the correlation structure described in Condition [3] 
has a more complex form, in that the matrix D involves not only the covariates but also the 
failure process and censoring mechanism. This complexity arises from the semiparametric 
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nature of survival models and the presence of censoring, and as a consequence, the ability 
of performing variable selection is affected by the interplay of several factors, including the 
covariates, baseline hazard, and censoring mechanism. Moreover, although our results allow 
the dimensions to grow at rates comparable to those available for linear regression models, 
our proofs suggest that the necessary sample size for observing the effects of these growth 
rates, which is determined by the constants, may be significantly larger. In addition, there 
is a rate loss in characterizing the convergence of the matrix V to its population counterpart 
D. These facts call for a need to increase the sample size for making reliable inference. 

3.2 Oracle Property 

In addition to model selection consistency, the oracle property requires the regularized esti- 
mator to be asymptotically as efficient as the oracle estimator with the true sparse model 
known a priori. For this purpose, some extra eigenvalue conditions are needed. Define 
Ai = Ainin(DyiA), ^2 = Amini^AA), and A3 = Aram{'D^\'S aa'D^\) ■ The oracle property of 
the proposed regularized estimators is stated in the following result. 

Theorem 2 (Oracle property). Assume that all conditions of TheoremU\hold. In addition, 
assume that 

nAj nAl nAfAa 



s2(log sY^ 



— )■ oo, 



— )■ oo, 



— )■ oo, 



and 



A?A3 



^0, 



(15) 



(16) 



where ri = (r + 4)/r. Then, for some constants D,K > 0, with probability at least 



D exp 



(V?-^ A /i A Ai)^ 



A 1 



l/ri 



D exp 



1/ri 



1, 



there exists a regularized estimator (3 that satisfies the following properties: 

(a) (Sparsity) (3j^c = 0. 

(b) (Asymptotic normality) For every u with ||u||2 = 1, 



V 



OA) 



iV(0,l). 



The three conditions in (1151) relate the sparsity dimension s to eigenvalue bounds for the 
matrices D^^, TIaa, and D^^Syi^D^^. If we consider the special case where the eigenvalues 
of these matrices are all bounded away from zero, then these conditions are trivially satisfied 
for s 



Gin 



l/3^ 



The form of f|T5|) . however, deals with much more general situations and 
is very meaningful in that the eigenvalue bounds are closely related to the difficulty of the 
estimation problem. 

In the context of linear re gression, it is well known that the Li-penalty does not have 



the oracle property (IZou 



2006 



Wainwrightl l2009l ) . It is clear from ( fT6l) that this is also the 
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case for the problem considered here. To see this, consider the case where Ai and A3 are 
fixed, and note that p'{d) = 1 for the Li-penalty. Conditions in f lTTj) imply that riA^ — t- 00, 
and hence (fT6|l cannot hold. For the SCAD and MCP penalties, (fT6|) is trivially satisfied for 
(i ^ A, since p'x{d) = in that case. For the SICA penalty, we have p'{d) = a(a + l)/(a + (i)^; 
thus, in order to obtain the oracle property, we should take a — > 0+ at a rate such that 
a and y/ns\a/d^ — )■ 0. This result is reasonable since the SICA penalty approaches the 
Lo-penalty as a — )■ 0+. 



4 Implementation 

In this section, we describe an efficient coordinate descent algorithm for implementation of 
the proposed methodology, analyze its convergence properties, and discuss the selection of 
tuning parameters. 

4.1 Coordinate Descent Algorithm 



he id ea o f coordinate optimization for pena l ized le ast-squares problems was proposed by 



( 1998 ) and 
fcoOTl l and 



Fu 



Friedman et al 



Daubechies. Defrise. and De Moll ( l2004l ) , and was demonstrated by 
Wu and Lang el J2008[ l to be exc eptionally efficient f o r large-scale sparse pr o blems . 



Fan and Lv 



Jmil), 



Breheny and Huang 



Jmil) 



Rec ently, a number of authors, including 
and iMazumder. Friedman, and Hastid (120111 ) , generalized this idea to regularized regression 
with concave penalties and showed that it is an attractive alterna tive to earlier proposals 



such as the loca l quadratic approx imation (LQA) ( iFan and Li 
imation (LLA) Jzou and Lilboosl ). 



2OOII) and local linear approx- 



To balance the regularization strengths on different components of /3, we minimize the 
weighted version of the objective function, 



i=i 

where, for simplicity, we assume that Vjj 7^ for all j. The coordinate descent method min- 
imizes the objective function in one coordinate at a time and cycles through all coordinates 
until convergence. To produce a solution path over a grid of values of the regularization 
parameter A, at each grid point the solution from a neighboring point is used as a warm 
start to accelerate convergence and, for concave penalties, to avoid suboptimal local solu- 
tions. Pick Amax sufficiently large such that /3 = is a (local) solution and take a decreasing 
sequence (Ai, ■ ■ ■ ,\k) with Ai = Amax! for the Lasso, SCAD, MCP, and SICA, one can 
take Amax = i^cix^^^(|6j|/V^j), where hj is the jth component of b. The following algorithm 



produces a solution path (3 over a grid of points A^, k 



K. 



Coordinate Descent Algorithm. 
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1. Initialize f3 = and set A; = 1. 

2. Cyclically for j = 1, ■ ■ ■ ,p, update the jth component f3j of f3 by the univariate (global) 
minimizer of Q{f3; Xk) with respect to /3j until convergence. 

3. Set 3^' = 3 and A; ^ A; + 1. 

4. Repeat Steps 2 and 3 until k = K + 1. 

The above algorithm requires efficiently solving a univariate regularization problem in 
Step 2. Since L{f3) is quadratic, closed-form solutions to this univariate problem exist 
for commonly used penalty functions, including the Lasso, SCAD, MCP, and SICA. The 
solutions for the first three penalties have been given in the aforementioned references, and 
we provide the solution for the SICA penalty in Appendix [Bl Coordinate descent may be 
slow if the resulting model is not sparse; hence, to save computation time, one can set a level 
of sparsity for early stopping if only models up to a certain size are desired. 

4.2 Convergence Analysis 

Denote by Z?™" = ■■ ■ the mth update vector generated by coordinate descent, 

i.e.. 



37 = argminQ(/3r,--- ,/3r-i,^,/5r+l\--- ,/5r') 



for j = 1, ■ ■ ■ ,p and m = 1, 2, ■ ■ ■ , where /3° = ■ ■ ■ , /3p)'^ is the starting point and we 
have suppressed the dependence of Q{f3) on A. Define the maximum concavity of the penalty 
function px{-) by 

k{px) = sup <^ 

0<ti<t2<oo L ^2 ~ tl 

For the Li-penalty, SCAD, MCP, and SICA, we have k{px) = 0, (a-l)~^ a"^ and 2A(a"^ + 
a^^), respectively. Due to the concavity of px{-), the objective function Q{j3) is generally 
nonconvex. However, under certain conditions such that the concavity of px{-) is dominated 
by the convexity of the quadratic loss function L{I3) componentwise (subject to normalization 
by Vjj), Q{(3) can still be strictly convex along each coordinate. This key observation leads 
to the following convergence result. 

Theorem 3 (Convergence of coordinate descent). Under Condition^ the sequence {/3™'} 
generated by the coordinate descent algorithm is bounded. Moreover, the following statements 
hold: 

(a) If the penalty function px{-) satisfies K,{px) < 1; then every cluster point of {Z?™"} is a 
local minimizer of Q{/3). 

(b) If in addition, the sequence eventually lies in a compact neighborhoods of (3* such 
that (3* is the unique local minimizer of Q{f3) in )C, then the sequence {/3™} converges 
to 13*. 



14 



Note that the condition k{px) < 1 is always satisfied by the Li-penalty, SCAD (a > 2), 
and MCP (a > 1). For the SICA penalty, the condition is satisfied if A(a^^ + a^^) < 1/2. 
This latter condition suggests that no matter what value A takes, one can adjust the value 
of a to ensure computational stability. Since the optimal A is often small, one can set a to a 
small value to achieve good performance of the SICA method. 

It is often useful to determine when a local minimizer f3 of Q{f3) is also a global min- 
imizer and when the sequence converges to it. The (restri cted) global o ptima lity of 



nonconcave penalized likelihood estimators was characterized by iFan and Lvl ( 1201 ll ). The 
following result gives sufficient conditions for the (restricted) global optimality of (3 on some 
subspace of MP and the convergence of {/S*"} to the global optimum. 

Theorem 4 (Restricted global optimality). Let (3 he a local minimizer ofQ{f3). For any 
subset S of {!,■ ■ ■ ,p}, denote by Bs the \S\- dimensional subspace {(3 E W : /3j = 0,j ^ S}. 
Under ConditionUl the following statements hold: 

(a) If (3 lies in Bs and Aj;ai^(Vss) > t^{px) "^^^jes ^jj , then 13 is a global minimizer of Q{(3) 
in Bs- 

(b) If the sequence {/3™'} generated by the coordinate descent algorithm eventually lies in Bs 
and Amin(V55') > k(pa) niaxjg5 V^j, then Q{(3) has a unique global minimizer (3* in Bs 
and the sequence converges to (3* . 

The condition in part (a) is trivially satisfied for the Li-penalty. For the SCAD, MCP, and 
SICA, the condition can be satisfied with some S if the correlation among cova riates is not 
too st rong and the concavity of the penalty function is not too large. Following 



Fan and Lv 



(I2OIII ). under some mild regularity conditions we can further establish the global optimality 
of (3 on the union of all IS* [-dimensional coordinate subspaces of W . Although the global 
optimality is somewhat hard to guarantee for concave penalties, we remark that it is not 
required for achieving practically good performance of the regularized estimator. In fact, 
our theoretical arguments suggest that a sparse local solution should possess nice statistical 
properties, while the coordinate descent algorithm is likely to find such a solution by carefully 
following a solution path from the sparsest end. 

4.3 Selection of Tuning Parameters 

After a solution path has been produced, the optimal regularization parameter A can be 
chosen by M-fold cross-validation. The cross-validation score is defined as 

M 

CV(A) = ^E^^™n3^'"^\A)), 

m=l 

where L^'"^(-) is the loss function given in ([3]) computed from the mth part of the data, 
and f3 (A) is the estimate from the data with the mth part removed. The concave 
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penalties have on e additional param eter a to be tuned. For the SCAD penalty, a = 3.7 



was suggested by iFan and Lil (|2001| ) from a Bayesian perspective and is commonly used in 
the literature. Since the MCP is similar to SCAD, we also take a = 3.7. The choice of a 
for the SICA penalty requires a little more caution, since a small a that is often needed to 
yield a superior theoretical performance can sometimes cause the problem of computational 
instability. This can be clearly seen from the discussion following Theorem [3l a too small 
a would cause the convexity condition A(a~^ + a~^) < 1/2 to be violated. To solve this 
dilemma, we first compute a pilot solution path with a larger a suggested by the convexity 
condition, and then set a to a smaller value and recompute the solution path by taking the 
pilot solutions as warm starts. If needed, the above process can be repeated several times 
to gain further stability. Often we take a = 1 for the pilot solution and one or a few values 
toward zero for recomputing the solution. We find that, in the settings we have considered, 
the practical performance of our methods is not sensitive to these choices. In the more 



difficult setti ngs and if computing resources al 
benefits; see Mazumder. Friedman, and Hastie 



o w, a fine tuning on a may yield additional 



(120111 ) for more discussion on producing a 



solution surface along (A, a). 



5 Simulation Studies 



We conducted three simulation studies to evaluate the finite-sample performance of the 
proposed regularization methods with the Lasso, SCAD, MCP, and SICA penalties, and 
compare them to the oracle estimator whic h knew the true sparse model in advance. Since 



the elastic net (Enet) (IZou and Hastiell2005l ) is also a common choice for the penalty function 
and is known to outperform the Lasso when the important variables are highly correlated, 
we also include it in our comparisons. 

The first simulation study aims to examine the case where p is comparable to but smaller 
than n. We generated data from the model 

A(t|Z) = l+/3^Z, 

where Z has a multivariate normal distribution with mean zero and covariance matrix 
(p'*~"'')f,j=i ^'^d subject to /3qZ > —1, and /3o = (v'^, ■ ■ ■ ,v'^, 0)"^ with the pattern v = 
(1, 0, —1, 0, 0, 0)"^ repeated q times. We set p = 0.1 and 0.5, and g = 3 so that the sparsity 
dimension s = 6. We considered p = 50 and 100, and n = 200. The censoring time C has a 
uniform distribution U (0, Cq) with cq chosen to obtain a censoring rate about 25%. 

We first choose the optimal regularization parameter A by tenfold cross-validation and 
evaluate the performance of the resulting estimators by six measures. The first two measures 
quantify prediction performance: PEl is the prediction error based on the loss function L(-) 
(up to a constant), and PE2 is the L2 prediction error in the excess risk, ||Z^(/3 — /3o)l|2, 
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both computed from an independent test sample of size 500. For estimation accuracy, we 
report the L2-I0SS ||/3 — /3oll2 and Li-loss \\f3 — /3olli- The other two measures pertain to 
model selection consistency: and t^FN refer to the number of selected variables and 
the number of incorrectly excluded variables (false negatives), respectively. The means and 
standard deviations of each measure over 100 replicates are summarized in Table [T] 

[Table [U about here.] 

From Table [1], we see that the Lasso and Enet had very close performance in this setting 
and the concave penalties outperformed the Lasso and Enet in that they selected a sparser 
model with better prediction and estimation performance. As expected from their similarity, 
the SCAD and MCP had comparable performance, of which the latter selected a slightly 
sparser model than the former. The SICA performed best among the five methods, with a 
performance very close to that of the oracle estimator; in most cases, it was able to identify 
exactly the six important variables. 

It has been noted that prediction-based criteria such as cross-validation tend to include 



too many irrelevant variable s in Lasso-type procedures (IMeinshausen and Biihlmann 



2006 



Leng. Lin, and Wahball2006l ). which was also observed in our simulations. Thus, it is natural 
to wonder whether the concave penalties really improve the variable selection performance 
over the Lasso and Enet, if the most appropriate amount of regularization for each method 
can be selected. In other words, it is sensible to compare the performance of the best model 
that ever exists on the solution path, assuming that we were assisted by an oracle which knew 
the true sparse model. Since our goal is to identify a parsimonious model that includes as 
many relevant variables as possible, we recorded the maximum number of correctly selected 
variables among all models up to a certain size on the solution path and averaged it over all 
replicates. By definition, this measure is an increasing function of the (maximum) model size 
and characterizes the best possible performance that can be achieved by any model selection 
criterion. 

The results on the above defined performance measure for the first simulation study are 
shown in Figure [T] It is clear from the figure that the SICA outperformed the other methods 
in that it had a high chance to identify all six important variables immediately after the 
model size reached the true sparsity dimension. The MCP also exhibited a performance 
boost above the true sparsity level, while the boost for SCAD was more subtle and occurred 
at a later stage. These trends are magnified when the correlation among covariates increases. 
The Enet improved on the performance of the Lasso only marginally in this case. 

[Figure [T] about here.] 
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In the second simulation study, we compare the performance of various methods in the 
setting where p is much larger than n. We used the same model setup as before, except that 
p = 2000 and 5000, and n = 500. The performance measures over 50 replicates with tenfold 
cross-validation to choose the optimal A are shown in Table [2l which indicates the same trends 
as in Tabled! Note that in the most difficult setting, p = 5000 and p = 0.5, both the Lasso 
and Enet missed on average at least one important variable, whereas the concave penalties 
still included all six important variables. The variable selection performance assisted by an 
oracle is then compared in Figure [2l The SICA was still the winner in this case, followed by 
MCP. It is interesting to note that the Enet clearly outperformed the Lasso with moderate 
model size; this difference, however, was not notable in the cross-validated results. Since 
cross-validation seems to have less tolerance on false negatives than on false positives in 
these scenarios, its behavior is better explained by the tail of the curve: the SCAD had a 
slight performance boost on the tail, whereas the Lasso and Enet were very close to each 
other and both below the SCAD. 

[Table [2] about here.] 
[Figure [2] about here.] 

The third simulation study reflects the more challenging case where the true model is 
only approximately sparse and we wish to analyze the robustness of our methods against 
departure from the sparsity assumption. To this end, in the previous model with p = 2000 
and n = 500, we randomly chose 44 or 94 of the zero coefficients and perturbed them by 
U{0,e) with signs randomly placed, so that s = 50 or 100, respectively. This setting is 
intended to mimic a possible scenario in genetic studies where a relatively large number of 
genes have nonzero but weak effects, while a few key genes with strong effects stand out. 
To describe different levels of weak effects, we set e = 0.1 and 0.2. The dimensionality 
apparently exceeds the limit for exact recovery of all, strong and weak, signals; recall from 
condition (|T3|) that we require n ^ s^logp with constant and ri = 1. In this case, even 
the oracle estimator, which is based on all covariates with nonzero effects, does not perform 
well. Thus, our target here is to identify the few key variables and a small portion of the 
variables with weak effects. 

The results with tenfold cross-validation over 50 replicates are shown in Table [3l where 
the added performance measure #FN-S is the number of incorrectly excluded variables with 
strong effects. These results indicate that the SICA had the best performance in identifying 
the strong effects, because it aggressively seeks a sparse representation of the data and treats 
the variables with weak effects as noise variables. If the weak effects are also of interest, 
however, the SCAD and MCP performed better in that they selected more variables with 
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nonzero effects at the expense of picking a larger model. It is worthwhile to note that when 
e = 0.1, the SCAD, MCP, and SICA all had better prediction and estimation accuracy than 
the oracle estimator, which demonstrates the advantages of sparse modeling. The variable 
selection performance of all methods assisted by an oracle is depicted in Figure |3l from 
which we see that the benefits of the concave penalties remain intact. Moreover, as the 
perturbation threshold e increases, i.e., the model deviates further from sparsity, the SCAD 
and MCP showed less aggressive behavior than the SICA and were able to identify more 
variables with weak effects. 



[Table [3] about here.] 
[Figure E] about here.] 

Finally, we point out that there are also scenarios where the Enet tends to outperform 
the other methods; for instance, when many highly correlated variables have comparable 
effects and the sparsity dimension is intrinsically high, the L2 term in the Enet penalty will 
play a pivotal role in regularization. These situations, however, are not our focus in sparse 
modeling and we do not pursue further in this article. 

6 Real Data Analysis 

We illustrate the proposed methods by an applica tion to the diffuse large- B-cell lymphoma 



(DLBCL) data analyzed by lRosenwald et al.l ( 120021 ). This data set consists of gene expression 



measurements for 7399 genes and survival outcomes after chemotherapy on 240 patients. The 
median follow-up time was 2.8 years and 138 patients died during the study period. The aim 
of the study w as to formulate a molecular predictor of survival after chemotherapy for the 
disease. As in 



Rosenwald et al. 



( 120021 ). the data set was randomly divided into a training 
set of 160 patients and a test set of 80 patients. We then applied the Lasso, SCAD, MCP, 
SICA, and Enet methods to the training set and used tenfold cross-validation to choose the 
optimal regularization parameter. 

To assess the prediction performance of the resulting model, in addition to computing 
the prediction error based on the loss function L{-), we classified the test patients into a 
low-risk group and a high-risk group of equal size, according to the individual predicted 
excess risk (3 Z, and performed a two-sample log-rank test. Table S] reports the number of 
selected genes, prediction error, and p-value of the log-rank test for each method, and the 
estimated coefficients for selected genes are given in Table [51 These results suggest that all 
methods performed reasonably well in prediction, but the concave penalties selected sparser 
models than the Lasso and Enet. To see the similarity of the prediction results, we note that 
60 of the 80 test patients received the same risk classification from all five methods. The 
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selected genes also exhibit some consistency across different methods, although variability 
was observed in the cross-validation procedure due to the small sample size and ultra-high 
dimension ality. We remark that, to gain further stability, the idea of the sure independence 



screening (iFan and Lvll2008l ) can be used to reduce the dimensionality to a more manageable 
scale before applying the regularization methods, in which case the prediction performance 
of all methods tends to improve. 

[Table m about here.] 
[Table [5] about here.] 

7 Discussion 

We have proposed a class of regularization methods for simultaneous variable selection and 
estimation in the additive hazards model. The main message of this article is that regu- 
larization methods can be used for variable selection with censored survival data in high- 
dimensional settings under conditions that are parallel to those in the linear regression 
context. Moreover, we have shown that concave penalties can substantially improve on the 
Li method and yield sparser models with better prediction performance, as evident from our 
theoretical and numerical results. Although we have focused on the additive hazards model, 
the empirical process techniques used here are fairly general and can be easily adapted to 
other survival models; for example, a key step to establishing a high-dimensional theory for 
the Cox model under similar conditions is to characterize the concentration of the empirical 
information matrix around its population counterpart. 

The fact that our theoretical results allow the dimensionality to grow exponentially with 
the sample size has important implications. In practice, how high dimensionality the pro- 
posed methods can handle depends critically on the sample size and the correlation structure 
of the covariates. Variable selection in regression, in general, is a very difficult problem, and 
is even more challenging in the survival context. As a consequence, a relatively large sample 
size is essential to making reliable inference. In situations where the proposed methods may 
fail, it would be desirable to explore strategies that combine the strengths of a variety of 
approaches, and regularization methods can then be used as building blocks in such more 
powerful procedures. 

We have partly based our performance comparisons on the best model selected by an 
oracle. In practice, how to choose the optimal regularization parameter remains a challenging 
issue for the regularization methodology. Although cross-validation is often the choice we 
have to resort to and sometimes performs better than AIC or BIC-type criteria, it suffers 
from several drawbacks and limitations. When the dimensionality is exceedingly high and 
the amount of regularization is necessarily large, the cross-validation curve can easily blow 
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up and re sult in selecting too few variables. If variable selection is the sole purpose, stability 
selection ( Meinshausen and Biihlmannl 2010 ) and related methods could potentially provide 
more accurate error control. These problems will be interesting topics for future research. 



Appendix A: Proofs 

For clarity and readability, before proceeding to the proofs of our main results, we first 
present a lemma that provides optimality conditions for the regularization problem and 
establish a series of concentration inequalities that are essential to the main proofs. These 
results are also of independent interest. For notational simplicity, constants that are used in 
our proofs may vary from line to line. 

A.l Optimality Conditions 

The following lemma provides optimality conditions that characterize a local solution to the 
regularization problem (j4]) and will be needed in the proof of Theorem [H 

Lemma A.l (Characterization of the regularized estimator). Under ConditionUl (3 G MP 
is a strict local minimizer of problem (jlj) if the following conditions hold: 

V^0) - ApU|3aI) ° sgn(3;f) = 0, (A.l) 

||U;j.(3)|U<Ap'(0+), (A.2) 

and 

Amin(V;J;j) > A/€(pa;3i), (A.3) 

where o is the Hadamard (entrywise) product and the functions | ■ |, p^(-); CL'i^d sgn(-) are 
applied componentwise. 

Proof. We first consider the 1^4 [-dimensional subspace B = ^W: (3^^ = 0}. Condition 
(IA.3|) implies that the objective function Q{(3) in is strictly convex in a neighborhood of 
(3 in B. Then condition (lA.ip implies that /3 is a stationary point and hence a strict local 
minimizer of Q{I3) in the subspace B. 

It remains to show that, for any f3i eMJ'\B that lies in a sufficiently small neighborhood 
of (3, we have Q{f3i) > Q{(3). To this end, let (32 be the projection of f3i onto the subspace B. 
Since Q{(32) — Q{P) from the preceding paragraph, it suffices to show that Q{(3i) > Q{(32)- 
By the mean value theorem, we have 

Q{(3,)-Q{(32) = Yl ^^Pf^'^= E {-t^.(r)+Ap',(|/3;|)sgn(/3;)}/3y, (A.4) 

where f3* = (/3J", ■ ■ ■ ,(3*)'^ is a point on the line segment between (3^ = (/3ii, ■ ■ ■ ,/3ip)'^ and 
/32- It follows from condition (1A.2P and continuity that \Uj{l3*)\ < \p\{\f3*\) sgn{P*) for all 
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j G A'^, provided that /3i, and hence f3* , is sufficiently close to /3. Using this fact and that 
sgn(/9*) = sgn(/3ij), we see that each term in flA.4p is positive, and thus Q{(3i) > (^(/Jg)- 
This completes the proof. 

A. 2 Concentration Inequalities 

The major complexity in our proofs lies in characterizing the concentration of the large 
matrices V and W and vector U(/3o). Due to the high dimensionality and dependency 
among entries, this is not a direct consequence of classical random matrix theory. Also, 
since each entry of the stochastic matrices or vector is not a sum of independent terms, the 
usual Hoeffding's inequality is not applicable. Hence, to establish the needed concentration 
inequalities, we rely on a functional Hoeffding-type inequality and some maximal inequalities 
for empirical processes as our primary mathematical tools. 

We adopt the standard empirical process notation. For any measurable function /, 
we denote by P„/ and Pf the expectations of / under the empirical measure P„ and the 
probability measure P, respectively. Let || ■ \\p^r denote the usual Lr(P)-norm. The "size" of 
a class J-" of functions is measured by the bracketing number N[]{e, J-", Lr{P)), the minimum 
number of e-brackets in Lr{P) needed to cover J-", and the covering number N{e, J-", L2{Q)), 
the minimum number of L2(Q)-balls of radius e needed to cover J-". The logarithms of the 
bracketing number and covering number are called entropy with bracketing and entropy, 
respectively. The bracketing integral and uniform entropy integral are defined as 

r.<5 



J[]{6,J^,L2{P)) = ^logN[]{e,J^,L2{P))de 

J(5,^,L2)= / ^ /log sup Nie\\F\\Q^2, ^, HQ)) de, 
Jo \ Q 



and 



respectively, where F is an envelope of J-', i.e., |/| < F for all / G J-", and the supremum is 
taken over all probability na e asure s Q with ||-F||q7- > 0- We refer the unfamiliar reader to 



Chapter 19 of Ivan der VaartI (119981 ) or Chapter 2 of iKosorokl (j2008[ ) for more definitions and 
concepts. 

Define S^''\t) = ^"=1 ^■(i)Zj(t)'^'^, k = 0,1,2, which are the sample counterparts 
of s('=)(t) defined in and recall that Z(t) = S^^\t)/S^'^\t). We begin with the following 
lemma, on which the other inequalities will be based. 

Lemma A. 2 (Concentration of S*-^''(-), /c = 0, 1, 2). Under Condition\^ there exist constants 
C,K > such that 



p( sup \S^^\t) - s^^\t)\ > Cn-^/^{l + x)^ < exp(-irx'). 



(A.5) 
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and 



p( sup \S^'\t) - s^'\t)\ > Cn-'/\l + x)\nL] < expi-Kx^L^), (A.6) 
\te[o,T] J 

pf sup \Sifit) - sffit)\ > Cn-'/\l + x) I < expi-Kxyi^), (A.7) 
Vte[o,r] / 



for all X > and i,j 



p, where Sf'\-) is the jth component ofS^'^^{-) and 



(2), 
ij \ 



IS 



the {i,j)th entry of the matrix S^^''(-). 



Proof. We only show (]A.6|) . and the other two inequalities follow similarly. Write Rj 
suptg[o,r] I Sf'^ (t) — s^-^^ ( t ) I ■ Th e main idea is to apply a functional Hoeffding-type inequality, 
Theorem 9 of 



Massart 



fl20o"o[ ). To this end, we need to control the term ERj. 



We first show that the class of functions {Y(t)Zj(t) : t G [0,r]} has bounded uniform 
entropy integral. Since a function of bounded variatior i can be expres sed as the difference of 
two increasing functions, it follows from Lemma 9.10 of lKosoroM (120081 ) that Zj = {Zj{t) : t G 
[0, r]| is a VC-hull class associate d with a VC class of index 2. Then, by Corollary 2.6.12 of 



van der Vaart and Wellnerl (Il996l ). the entropy of Zj satisfies \ogN{e\\F\\Q^2, 2j, L2{Q)) < 
K'{l/e) for some constant i^' > 0, and hence Zj has the uniform entropy integral 



J{l,Zj,L2)< [ ^/K'{l/e)de < oo. 



Also, by Example 19.16 of 



van der Vaart 



fll998h . y = {Y{t): t G [0,r]} i s a VC c 



hence has bounded uniform entropy integral. Thus, by Theorem 9.15 of 
yZj has bounded uniform entropy Integra 
Now an application of Lemma 19.38 of 



Kosorok 



a ss an d 
f|2008k 



van der Vaart 



mm gives 

ERj < C'n^'/^J{l,yZj,L2)\\F\\p^2 < Cn'^/^ 

for some constants C',C > 0, wh e re the envelope F is taken as sup^g^o.r] It 
follows from Theorem 9 of 



Massart 



(12000f ) that 

P{Rj > Cn-^/\l + x) I ^l) < P{Rj > ERj + Cn-'^/'^x \ ^l) < exp(-KxV^^) 
for some constant K > 0, which concludes the proof. 

Lemma A. 3 (Concentration of U(/3g)). Under Conditions\^ there exist constants C,D, 
K > such that 



P{\Uj{(3Q)\ > Cn-^/\l + x)\nL) < Dexp(^-K 



X An 



for all X > and j = 1, 
Proof. We first write 



,p, where t/j(/3o) is the jth component o/U(/3o). 



Uj{l3o) = Pn / {Zj{t) - Zj{t)} dM{t) = P„ / Z,{t) dM{t) - P„ / Z,{t) dM{t) =Ti-T2 
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where Zj{-) is the jth component of Z(-). Since term Ti is an i-i-d. su m of mean-zero 
random variables, an apphcation of Hoeffding's inequahty (IHoeffdingl Il963l ) gives -P(|Ti| > 
I VIl) < 2 exp(— i^x^/L'^) for some c onsta nt K > 0. 



We will apply Theorem 9 of iMassartI (|2000[ ) to bound term T2. Note that, from flA.SP 
and flA.6p in Lemma [A. 2 1 we have 



sup 

tG[0,T 



Ml 



it)\>6] < exp{-Kn) 



and 



pf sup \S^^\t) - s^.p{t)\ >5\nL] < eM-Kn/L^), 
Vte[o,rl / 



for some constant 5 > and j = 1, ■ ■ ■ ,p. Since these two tail probabilities are bounded by 
exp(— fCn/L^) for L > 1, it suffices to consider the case where supjg[o,r] — < 5 



and suptg[o,^] \sf\t) 



sf \t)\ < (5 for J = 1, 



,(1) 



,p. Write 



,(0) 



Since S^'^^-) and s^'^^-) are bounded away from zero by Condition [2]^ii), the above repre- 
sentation implies that sup^^jg,.] \Zj(t) — ej(t)\ < 6' for some constant 6' > 0. Let J-'j denote 
the class of functions / : [0, r] — > M that are of uniformly bounded variation and satisfy 
supjg[o,r] \ fit) — ej(t)| < 6'. Define the class of functions Qj = {Jq fit) dM{t): f G J^j} and 
Gj = supggg^, |(P„ — P)g\ = supggg^ l^ndl- We need to control EGj. 

By constructing || • ||oo-balls centered at piecewise constant functions on a regular grid, 
one can show that the covering number of the class J-'j satisfies N{e,J-'j, \\ ■ ||oo) < (K/e)^'^^ 
for some constants K,K' > 0. Note also that, for any /i, /2 G J^j, 



h{t)dM{t) 



h{t) dM{t) 



< sup |/i(s) 



Ms)\ / \dMit)\. 
Jo 



By Theorem 2.7.11 of Ivan der Vaart and Wellnerl (119961 ) . the bracketing number of the class 
Qj satisfies N[pe\\F\\p^2,Gj, L2{P)) < N{e,J^j, \\ ■ lU) < (K/e)'''/', where F = £ \dM{t)\. 
Hence, Qj has bounded bracketing integ ral. 



An application of Corollary 19.35 of 



van der Vaart 



mm yields that 

EG, < C'n-'/^J[]{\\G\\p,2,Qj,L2iP)) < Cn'^l^ 
for some constants C\C > 0, where G is an envelope of Qj. We then apply Theorem 9 of 



Massart 



( 200' 



J to obtain P(|T2| > Gn~^/'^{1 + x) \ fi^) < exp^-Kx"^ / L^) for some constant 



i^' > 0. Putting the bounds for Ti and T2 together leads to the desired inequality, thus 
completing the proof. 



Lemma A. 4 (Concentration of V). Under Condition\^ there exist constants C, D, K > 
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such that 



for all X > and i, j = 1, ■ ■ ■ ,p, where Vij and Dij are the {i,j)th entries of the matrices V 
and D, respectively. 

Proof. We first write 

Clearly, f OTfl) in Lemma implies that P(|ri| > Cn'^^^{l + x) | fi^) < exp^-Kx"^ /L*). 
To bound term T2, write 

By the same arguments as in the proof of Lemma IA.3t it suffices to consider the case where 
supig[o,r] — < 6 and sup^gjo^,-] \Sj^\t) — s^j^\t)\ < 6 for some constant 6 > and 

j = 1, ■ • ■ ,p. From the above representation and ( lA.SI) and ( ]A.6p in Lemma \A.2\ it follows 
that P{\T2\ > Cn~^/^{l + x) \ ^l) < 3exp{-Kx^/L^). Combining the bounds for Ti and T2 
yields the desired inequality and concludes the proof. 

Lemma A. 5 (Concentration of W). Under ConditionlE, there exist constants C, D, K > 
such that 

P{m,-E,,\>Cn-'/\l+x)\nL)<Dexp(^-K^^^'^ 

for all X > and i,j = 1, ■ ■ ■ ,p, where Wij and Sj^ are the {i,j)th entries of the matrices 
W and S, respectively. 

Proof. We first write 

Wij-J:ij = {F^-P) [ Z,{t)Z,{t)dN{t) 

Jo 

P„, / Zi{t)Zj{t)dN{t)-P I ei{t)Zj{t)dN{t) 
Jo Jo 

F„. [ Zi{t)Zj{t)dN{t)-P I Zi{t)ej{t)dN{t) 
Jo Jo 

+ |p„^ Z,{t)Zj{t)dN{t)-P e,{t)e,{t)dN{t)^ 
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Since term Ti is an i.i.d. sum, an application of Hoeffding's inequality gives -P(|Ti| > 
n~^^'^x I VIl) < 2exp(— fCx^/L'*). To bound term T2, write 

T2 = (P„ - P) / Z,{t)Z,{t) dN{t) + P / {Z,{t) - e,{t)}Z,{t) dN{t) = T21 + T22. 
Jo Jo 

Term T21 can be bounded similarly as term T2 in the proof of Lemma [A. 3 1 Also, note that 

1^22! < sup |Zi(s)-ei(s)|P / \Zj{t)\dN{t). 

sG[0,r] Jo 

Then it follows from flA.Sp and Lemma IA.2I that 

Pi\T22\>Cn-'^'il + x)\nL)<Dexp(^-K^^-^y 

We can bound terms T3 and T4 similarly and thus obtain the desired inequality. This 
completes the proof. 

A. 3 Proof of Lemma [1] 

By the union bound and Lemma IA.41 we have 



'AA\\ 00 



> 



2ip 



= p( maxS~^ \ Vii — Dij\ > — 

ieA jrgA ^ ^ 



2if 

1 

2ips 



i€A jeA 



A, I > 



< s^Dexp{ -K— 



n 



2ip 



A 1 



By an error bound for matrix inversion (IHorn and Johnson 
1/(2^9), then 



1985 



p. 336), if ||V^a-Daa||oo < 



. . ~ \ -L 



1 - V'IIVaa - DaaiIoo 
- II 

AA -^AA II 00 



and hence llV;^;^||oo < ||D;^^||oo + ||V;^^ - D^^^l^Hoo = ^ + ||V;^]^ - D~^||oo < 2ip. Then 



probability bound (JTj) follows. 

To show probability bound ([H]), we write 

^ A'^A^J^A ~ Da^aD^^ = iy A'^A — Da<=a)V^^ + Dyic^(V^^ — D^^) 

= (Va=a — Da=a)V^^ — D^c^D^^(Vaa — Daa)V^^ 
= Ti - T2. 

Similarly as before, by the union bound and Lemma IA.4t we have 



P 



IV 



A'^A 



D 



A'^A 00 



> 



1 fa p'{0+) 
2^\a p',{d) 



A ( 
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< {p — s)sD exp 



-K 



This, along with ([7]), gives 
«p'(0+ 



P 



U^l oo > 



4 p',(^;) 



< P 



IV 



1 rap'(0+) 



2^14 



A ( -n^ 
2 



+ ^(llv:4illoo>2</^lfii) 



< (p — s)sZ^ exp 



-K 



n \ yy{dY^^niY ^^^^ 



Also, by Condition |3] and ([7]), we have 
■ap'(0+ 



P 



|^2||oo > 
< P 



A ( V 



4 p^(d) 



n f \ 



A 1 



2if\A{l-a) 2 



a 1 

A 



+ P{\\^A\\\oc>2ip\nL) 



< s^Dexpi -K— 



n 



A 1 



Putting the bounds for Ti and T2 together, we obtain 

'ap'{0+ 



P 



A^A^ AA ~ D^c^D^^lloo 



> 



2 p\id) 



A (cn^) I fii 



< — exp 



L4 [ 



s^Dexp<^ -^T7 



n / 1 



Vy^a^a 



A 1 



Then probability bound (|8]) follows from Condition [3] and the triangle inequality. 



Finally, to show probability bound ([9]), by the Hoffman- Wielandt inequality (IHorn and Johnson 



19851 ). we have 



lAmin(V^A) -Amin(D^^)| < | ^ | A(,) ( V^^) - A(,) (D^^) | ^ | 

^ 7=1 ^ 



1/2 



< IIV.^-D 



AA — ^AA\\F^ 



where Aq)(-) denotes the j'th smallest eigenvalue and || ■ ||f is the Frobenius norm. Then it 
follows from the union bound and Lemma IA.4I that 

P(|A^in(VAA) - A,,i„(DAA)| >Ii\^l) 

\,j£A ' 

< Y: p(\V, - A,| > ^ I n,) < s'De.J-K^M A l) |, 
which, by the definition of p, implies (Q. This concludes the proof. 
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A. 4 Proof of Theorem [T] 

The main idea of the proof is to first define an event with a high probabihty and then 
analyze the behavior of the regularized estimator f3 conditional on that event. The first part 
involves concentration inequalities developed in Section IA.2I and Lemma [H while the second 
part involves nonprobabilistic arguments based on Lemma lA.ll 
First, by the union bound and Lemma [A. 3t we have 

1 



^ I|Ua(/3o) 



> 



a 



2cn'y 4 



Ap'(o+) I 



Similarly, we have 



n 



l] <sDexpi-K-A— Al 



p(^||Uac(/3o)IIoo > ^Ap'(0+) I fiz.) <{P- s)Dexp|-ir-^(A2 A 
Also, Condition EJ^iii) and the union bound imply that 

Pi^l) < sup \Zj{t)\ > l] < pDexpi-KU). 

~[ Vie[o,r] / 

It follows from ( 1A.9I) -( IA.11[) and Lemma [1] that with probability at least 



1 — {p — s)sD exp 



s^D exp 



-K 



A 1 



n 



- sDexpi -K—l — A 1 



n 



- (p- s)Dexp<J -K^{X^ A 1) \ ~pDexp{-KU 



the following inequalities hold: 



|U^(/3o)lloo<77^?Ap'(0+), 



a 



2cn' 4 



|Ua=(/3o)||oo<-Ap'(0+) 



1 - 



«\p'(0+) 



2) p'.id) 



A (2cn^), 



and 



(A.9) 



(AlO) 



(All) 



(A.12) 



(A13) 



-^min 

From now on, we condition on the event that these inequalities hold. It suffices to find a 
/3 G that satisfies all the optimality conditions in Lemma lA.ll and the desired properties. 
To this end, take /3^c = 0, and we will first determine by solving equation (lA.ip . Since 
U(/3) = b — V/3, we have Ua(/3) = U^(/3o) — Vaa(/3^ — (^oa)- Substituting this into the 
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equation U^(3) - Xp'xHPaI) ° ^St^0a) = gives 

3a - f3oA = V-Ai^AW - Ap',(|3^|) o sgn(3^)}. (A.14) 

Define the function / : W -> by = /3oA + V^i{UA(/3o)- VA(l^l)°sgn(0)}, and let /C 
denote the hypercube {0 G M"*: \\6 — (^qaWoo < ciLpXp'{0+)}. By the inequahties estabhshed 
before, for G /C, we have 

WfiO) - /3o^l|oo < l|Vli|U{||U^(/3o)lloo + Ap'(0+)} 

- ^^{2^^!^'"'^°+^ + V(0+)| < ci^Ap'(0+), 

i.e., f{JC) C /C. Also, condition ( lT2l) implies that for G /C, ||0 — /3o^||oo < c?, so that 
sgn(0) = sgn(/3oA)- Thus, in view of Condition [H / is a continuous function on the convex, 
compact hypercube /C. An application of Brouwer's fixed point theorem yields that equation 
(IA.14P has a solution in /C. Moreover, sgn(/3^) = sgn(/3Q^) and hence A = A. Thus, 
we have found a /3 G that satisfies ( lA.ip and the desired properties. It remains to check 
conditions f[01) and ( 1X31) . 

To verify that f3 satisfies (lA.2p . we write 

Uac(3) = Uac(/3o) - yA^A0A - /3oa) 

= Ua.(/3o) - Va^a^AI^aW - Xp'xilM ° sgn(3A)}, 

where we have substituted ( 1A.14I) . Since — /3oaIIoo < c^, we have ||/3aIIoo = ||/3oa + (/3a~ 
/3oa)I|oo > ||/3oaIIoo - ||3a -/3oaI|oo >2d-d = d. The triangle inequality, concavity of pa(-)) 
and the inequalities established before together imply that 

||Uac(3)||oo < ||Uac(/3o)||oo + ||V^MV^i|U{||U^(/3o)|U + Apl(rf)} 

< ^Ap'(0+) + 2cn^^^Ap'(0-,) + (l - f )f^Apl(.) 

= ^Ap'(0+) + ^Ap'(0+) + (1 - |)Ap'(0+) 
= Ap'(0+). 

Since ||/3^ — /^oaIIoo < d, it follows from ( ]A.13I) that Amin(VyiA) > Akq > Xn{px', I^a)-, ^"^^ 
hence ( lA.Sp is satisfied. Finally, we choose L by matching the exponential terms in ( 1A.12P 
and note that the probability tends to 1 by conditions ffTOl) and ffTTl) . This completes the 
proof. 
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A. 5 Proof of Theorem [2] 

The proof Theorem [2] is based on the proof of Theorem [1] in Section |A.4[ First, by the same 
arguments as for deriving probabihty bound in Lemma [H one can obtain 

P(Amin(VAA) < Al/2 I ^l) 

= P(|A^i„(V^^) - All > Ai/2 I ^l) < s^Dexp\ -K—{-^ A 1 
Thus, with probabihty at least 



1 - s^Dexp{ -K—[ ^ A 1 ) \ -pDexY>{-KU), (A. 15) 



it holds that Is^miniy aa) > Ai/2, and hence 

WyAh = l/^min(VAA) < 2/A,. (A.16) 

From now on, we condition on the intersection of the event that flA.16p holds and the 
event defined in the proof of Theorem [H such an event still has a high probability. Since 
sparsity has been established in Theorem [H we only need to show the asymptotic normality. 
By substituting (]A.14p . we can write 

= V^u^S^y^U^(/3o) + V^u^^-2l'BAAiyA\ - Daa)Ua(/3o) 

= Ti+T2-T3. 
We first consider term T2. Since ||u||2 = 1, we have 

IT2I < v^||S;4y'D^A||2||D;4]i||2||V^^-D^A||2||V^],||2||UA(/3o)l|2. 

It follows from Lemma IA.4I that 

||Vaa - Daa||2 < ||Vaa - D^aIIf < I s^max - Ajl = sOp{n ^/^), 

V «jga j 

and similarly, by Lemma IAl3l ||U^(/3o)||2 = V^Op(n^^/^). Using also ||S^y^DyiA||2 = A3 
||D^^||2 = 1/Ai, and (IA.16I) . we obtain 

9,3/2 

IT2I < v^A3 ^/'Ar^50,(n-i/2)2A-iy^O,(n-i/2) = -^0,(^-^/2) = o,(l), 

A?A3/ 

by the third condition in f[T^ . We then consider term T3. The concavity of px{-), (IA.16p . 
and condition (ITB]) lead to 



IT3I < v/^||S-/^D^^||2||V- ||2Av/^pl(rf) < V^lVp ^ ^ 

A1A3 
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It remains to show that term Ti is asymptotically normal. Note that 

It follows from Lemma [A. 5 1 that ||WyiA — S^yilh = sOp{n^^^'^). Then the second term in the 
above display is bounded by 

in view of the second condition in (|T5l) . An application of the martingale central limit 
theorem ( Andersen and GillllQsd ) yields that Ti is asymptotically standard normal. Finally, 
we choose the optimal L in (IA.15I) . note that the probability tends to 1 by the first condition 
in ( IT^ , and combine it with the probability in Theorem [TJ This concludes the proof. 

A.6 Proof of Theorem [3] 

Since the sequence of objective functions {Q{f3^)} is decreasing, we see that the sequence 
{(5(/3'")} lies in a compact set of M. This entails that {(3"^} is bounded, since any \f3j\ — i- 00 
would imply that Q{I3) — ?■ 00 by the assumption Vjj 7^ for all j. Denote by = 

To show part (a), let (3* be a cluster point of {/3™'} and {/S"'*} a subsequence of {/S™} 
that converges to (3* . We first prove a claim that if (3^'' — (3^''^^ — )■ 0, then (3* is a local 
minimizer of Q{(3). Denote by df{^; a) the directional derivative of a function / along the 
direction ck. For any 6 = (6*1, ■ ■ ■ ,6p) E M^, we have 

dQif3*; 0) = J2 + E ^ndPxW;-^ = E ^^■^^■)' (A'^^) 

j=i Pi j=i j=i 

where /3* is the jth component of (3* and e^ is the p-vector with 1 at the jth compo- 
nent and elsewhere. Since minimizes ■ ■ ■ , ■, Pj^i'^, ■ ■ ■ , /3^''~'^), we have 
dQiaJ'^e^ej) > 0. Since /3""' - ^ 0, we have lim^^oo = Hmk^^ (3""^ = (3* 
and hence Ymvk^^ aj ' '' = ( 3* . It then follows from the upper semicontinuity of directional 
derivatives ( jBertsekasI 119991 ) that 

dQ{[3*-ejej) > limsupag(a™^%ej) > 

fe— )-CXD 

for all j. In view of (OTTTl) . we have 6) > for all e W, so that /3* is a local 

minimizer of Q{(3). It remains to show that — /3'"''~^ — )► 0. 

In fact, we will show that /3™ — — )■ 0. Consider the update from Q;™:^ to a™. The 
condition ^(pa) < 1 implies that Q{(3) is strictly convex in (3j, so that 

for every subgradient 9 of g(/3f , ■ ■ ■ , /3™ 1, -, (3^-^\ ■ ■ ■ , at /3f , where cq = l-/t(j5A) > 
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(jDem'yanov and Vasirevlll985[ ). The optimality of /3™ entails that is a subgradient and 
hence 

Adding both sides over j = 1, ■ ■ ■ , p and m = 1, ■ ■ ■ , M yields 

M 
m=l 

Noting that the sequence {Q(/3"')} is bounded, we have X]m=i ll/^"*"^ ^ f^^Wi < ^5 so that 
_ pm-i _^ Q pg^^^ (^^^^ proved. 

To show part (b), we use a contradiction argument. Suppose that the sequence {f3^} does 
not converge to (3* . Then there exists a subsequence of {/S'"} such that — /3* II2 > 

e for some e > 0. Since {Z?™*} is bounded, by taking a further subsequence if necessary, 
we can assume that converges to a point Clearly G /C since /C is closed, 

and 11/3** — (S*\\2 > It follows from part (a) that (3** is a local minimizer of Q{/3). This 
contradicts the assumption that /3* is the unique local minimizer in /C and proves part (b). 

A. 7 Proof of Theorem |3] 

The inequality Amin(V5S') > n{p\) maxj^s^jj in part (a) ensures that Q{(3) is convex on the 
subspace Bs, from which the restricted global optimality follows. Part (a) is proved. 

To show part (b), note first that the strict inequality AmmlVg^) > n{px) maXjfzsVjj 
implies strict convexity and hence the existence of a unique global minimizer f3* of Q{I3) 
on Bs- Also, since AminCV^s) < min^g^ V^j < maxj^sVjj, we have n{p\) < 1. It then 
follows from Theorem [3] and the boundedness of the sequence {/3™} that the sequence {(3"^} 
converges to f3*. This proves part (b). 

Appendix B: SICA-Penalized Least Squares in One Dimension 

In this appendix, we present an analytic form of the SICA-regularized estimator in one 
dimension. Consider the one-dimensional SICA-penalized least-squares problem 



^=argmin(^(^^-^^o)'+PA(|^^|)|, 
em ) 



where 6*0 G M and p\{-) = Ap(-) with the SICA penalty p(-) given by To find the nonzero 
stationary points of the objective function, it is easy to derive from the derivative equation 
that we need to solve the cubic equation + C2t^ + cit + cq = for the positive roots, where 
C2 = 2a — \9o\, ci = a? — 2016*01, and cq = Aa(a + 1) — a^|6'o|. Let Q = (03 — 3ci)/9 and 
R = (2c| — 9ciC2 + 27co)/54. If < R^, the cubic equation either has a unique, negative root 
or, in addition, has a real root corresponding to an inflection point; hence 6 = 0. Otherwise, 
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compute the two largest roots 

ti = -2^yQcOs{ J ~ y < ^2 = -2^yQcOs{ 1 - — , 

where a — arccos(i?/ y^Q^)- If > 0, then ti is a local maximum and t2 is a local minimum; 
by comparing the values of the objective function at and ^2, we have that 9 — sgn(^o)^2 if 
^2/2 + A(a + l)/(a + ^2) < ^0, and ^ = otherwise. If < and t2 > 0, then t2 is a local 
minimum and 9 — sgn(^o)^2- Otherwise, the cubic equation has no positive roots and ^ = 0. 
In summary, 9 = sgn(^o)^2 if > R^, h > 0, and t2/2 + A(a + l)/(a + ^2) < ^0, or 
> R"^, ti < 0, and ^2 > 0; otherwise ^ = 0. This gives a complete characterization of the 
SICA solution 9 in one dimension. 
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Table 1: Results for various methods in the first simulation study with n — 200, s — 6, and censoring rate about 25%. Values 
shown are means (standard deviations) of each performance measure over 100 replicates 
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Table 2: Results for various methods in the second simulation study with n — 500, s = 6, and censoring rate about 25%. Values 
shown are means (standard deviations) of each performance measure over 50 replicates 
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Table 3: Results for various methods in the third simulation study with n = 500, p = 2000, and censoring rate about 25%. 
Values shown are means (standard deviations) of each performance measure over 50 replicates. The oracle method is based on 
all nonzero (strong and weak) effects 
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Table 4: Results for various methods applied to the DLBCL data 
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Table 5: Estimated coefficients for selected genes in the DLBCL data by various methods 
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Figure 1: Variable selection performance for various methods in the first simulation study with 
•p = 100. The vertical line indicates the true sparsity dimension. 
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Figure 2: Variable selection performance for various methods in the second simulation study with 
p = 0.5. The vertical line indicates the true sparsity dimension. 
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Figure 3: Variable selection performance for various methods in the third simulation study. The 
vertical lines indicate the sparsity level with strong effects (solid) and the sparsity level with all 
nonzero effects (dashed). 



41 



